Kafka Monitoring: 
What Matters! 


Amrit Sarkar 


THIS IS NOT A CONTRIBUTION 


Agenda 


e Kafka Basics 
e Performance Areas 
e Need for Observability 
o Monitoring Options 
e Performance classification around Components 
e Kafka Consumer Lag evaluation 
o absolute to relative 


e Trend Analysis 


Kafka Basics 


Kafka moves data between producers (writers) and consumers (readers), 
with data protection, high availability, low latency at high scale! 
Use cases: Metrics, Log Aggregation Solution & Stream Processing 
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Brokers use ZK to manage Zookeeper 


* ZK has been deprecated / removed in newer versions 
and share state*** ý 
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Kafka Basics 
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Kafka Basics Kafka Cluster 
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Zookeeper 
* ZK has been deprecated / removed in newer versions 


Kafka Basics Kafka Cluster 


Topic Partition Replica 


Leader Partition Replica 


Controller Broker 
* Create and Delete Topics 


* Partitions States & Leaders reassignment 
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Zookeeper*** 


* ZK has been deprecated / removed in newer versions 


Performance Areas 


Throughput & Latency 
o Production Rates 
o Consumption Rates 
o Consumers’ Lag 
e Data Integrity 
o Reads Confirmation 
o — Writes Confirmation 
e Fault Tolerance 
o No business impact 
on failure 


Resource Usage 
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Why do we need Observability? 


Pre-built dashboards monitor and alert for anticipated future performance issues. 


Explore and quickly identify unanticipated issue root causes in an observability scenario. 


Kafka doesn't self report Monitoring tells you metrics Observability guides to fix 
problems, it reports metrics represent a problem the problem 


Active Controller Count Servers restart, hostname 
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Performance classification around Components 


Each component act as potential factor inthe e Producers 


performance of Kafka messaging system. o Rate 
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o Transmission (Network) health 


o Capacity 


* ZK has been deprecated / removed in newer versions 


Monitoring Options - Getting Metrics In 


e Confluent Control Center 


Kafka Cluster 


e  KafDrop 
C Brokeri (Cm) e Yahoo Kafka Manager 
e Cruise Control 
(C Broker2 (ax) e Kafka Monitor 
Sg 
e Kafka Tool 
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Producer: Rate 


send() function call (or similar) to push Offset value can be pushed to metrics store for visualisation. 


data — RecordMetadata object 


Producer Rate v 


o offset() function returns a 
LONG — offset of the the 


record in the topic-partition. 
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Producer: Compression & Latency 


e Bigger batches > 


O 
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higher throughput 

less compression 

Small enough to keep GC <> 
(<< 10mb) 


Ideally 


Batch Size in Bytes — Optimally High 
kafka.producer:type=producer-metrics,client-id="{client-id}" 
batch-size-avg 

Compression Rate > LOW 


kafka.producer:type=producer-metrics,client-id="{client-id}" 
compression-rate-avg 


Request Latency — LOW 


kafka.producer:type=producer-metrics,client-id="{client-id}" 
request-latency-avg 


The Big Four - Key Metrics (JMX) 


e Number of active controllers, must be = 1 e Number of under min ISR partitions, must be = 0 
kafka.controller:name=ActiveControllerCount, kafka.server:name=UnderMinlsrPartitionCount, 
type=KafkaController type=ReplicaManager 


Checkout: ‘UnderReplicatedPartitions’ metric too 


e Number of offline partitions, must be = 0 e Consumer Lag (per partition) 
kafka.controller:name=OfflinePartitionsCount, kafka.consumer:name=MaxLag, 
type=KafkaController type=ConsumerFetcherManager,clientld=([-.w]+) 


The Big Four - Key Metrics (JMX) 


Kafka Active Controller Count Kafka Offline Partitions Count 


broker-0 broker-1 


0 0 


Kafka Under Min In-Sync Replicas Partitions 
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Brokers' Health 


e Load Skewness (number of partitions on a broker) 
kafka.server:name=PartitionCount, 
type=ReplicaManager 


e Network Request & Error Rate 
kafka.network:name=RequestsPerSec/ErrorsPerSec, 


type=RequestMetrics 


Log Flush Latency 
kafka.log:name=LogFlushRateAndTimeMs, 

type=LogFlushStats 

Fetcher Lag (per topic per partition) 


kafka.server:name=ConsumerLag, 
type=FetcherLagMetrics,clientld=([-\w]+),topic=([-\w]+), 
partition=([0-9]+) 


Brokers' Health 
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Consumers: Commit Rate 


Too Frequent Commit Rate 


Increased Network Overhead 


Increased Load on Broker 
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Ideally 
Commit after BE 
Batch of Messages Less Frequent Rate 


commitAsync ---> Improve Throughput 


Tune "auto.commit.interval.ms' 


Consumption Rate 


__consumer_offsets topic can be consumed; 
offset long value emitted as metric to visualise every consumer-partition's committed offset in 


near real-time. 
Reference Burrow (discussed later) which already does it. 
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Lag, Evaluations & Alerts: Burrow 


Monitoring tool provides consumer lag check as a service 
Exposes offset lag for all consumer-partition combination as Prometheus metrics. 


Kafka Consumer Lag 


Monitors committed offsets and calculates the status of those consumers on demand. 


e Able to send alerts 


Lag - Offsets Trend Evaluation 
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Status of consumer: OK 
Lag Series with no Uptrend & Consumer Offset Series not Stalled 


Lag - Consumer is Slow! 
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Status of consumer: WARNING 
Lag Series with Uptrend & Consumer Offset Series not Stalled 


Lag - Consumer Stalled 


Kafka Consumer Lag Consumer Kafka Offset Hourly Offset Rate 
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Status of consumer: STALLED 
Lag Series with Uptrend & Consumer Offset Series Stalled 


Lag - Observability 
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Lag - Time Based 


© . Lo 


[ <x VE H ] 154-134 


CONSUMER 


20 


TIMELINE 1 time-minute-units 
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Msg_Offset: 134 Msg_Offset: 144 Msg_Offset: 154 


Diff (Last Consumed Offset, Last Produced Offset) 


Producer Rate 


https://www.confluent.io/blog/kafka-lag-monitoring-and-metrics-at-appsflyer/#:-:text=Kafka%20basics&text=Lag%20is%20simply%20the%20delta,save%200ffsets%20in%20ZooKeeper%20itself. 


Lag - Time Based 
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Trend Analysis 


Keep track of high-level metrics for: 
e Rate of Topic Growth —— Do we need more partitions? 


e Weekly / Monthly / Periodic 
Producer / Consumer Rate 


—— Keeping tabs on abnormal spikes! 


e TTL/Retention data long enough to hold If time lag for consumer-partition 
data for consumption goes beyond control! 

e Infrastructure supporting Kafka cluster CPU, Memory, Network, IO capacity, 
requirements GC Activity 

e Zookeeper supporting Kafka cluster How many topics / partitions state can 
requirements be kept? 


* ZK has been deprecated / removed in newer versions 
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